Skip to content

feat(server): central audited chokepoint for customer-data drops + CI guard (truehomie hardening)#50

Merged
mastermanas805 merged 2 commits into
masterfrom
feat/drop-chokepoint-audit-guard
Jun 6, 2026
Merged

feat(server): central audited chokepoint for customer-data drops + CI guard (truehomie hardening)#50
mastermanas805 merged 2 commits into
masterfrom
feat/drop-chokepoint-audit-guard

Conversation

@mastermanas805

Copy link
Copy Markdown
Member

What & why

Addresses the OPEN root-cause class of the truehomie-db DROP incident (2026-06-03): an active Pro customer's Postgres DB + role were dropped on the shared postgres-customers cluster by an unidentified path with NO audit_log row. The provisioner was a dumb executorDeprovisionResource dropped whatever (token, provider_resource_id, resource_type) it received and kept no record of its own.

Full enumeration + design + ranked root-cause hypotheses: docs/ci/DATA-INTEGRITY-DROP-PATH-AUDIT.md (docs repo).

Changes

  1. guardedDrop chokepoint (internal/server/drop_chokepoint.go) — the single sanctioned wrapper for a customer-data destruction. All 8 backend Deprovision dispatches in DeprovisionResource (postgres/redis/mongo/queue × shared/dedicated) now route through it. It emits, before the drop:
    • a structured event=provisioner.drop audit log line: token, provider_resource_id, resource_type, backend, request_id, caller (gRPC peer addr — the attribution that was missing in the incident). Always-on, app-layer equivalent of the cluster log_statement='ddl' trap.
    • instant_provisioner_drop_total{resource_type,backend,outcome} (eager) — an abnormal drop rate is now alertable (the incident was a burst of un-attributed drops).
  2. CI guard (internal/server/drop_guard_test.go) — AST-iterating (rule 18). Walks the provisioner source and fails the build if any DROP DATABASE/ROLE/USER SQL literal, ACL DELUSER, mongo dropDatabase, or Database(...).Drop(...) call appears outside a sanctioned deprovision function reached through guardedDrop. A new un-audited drop path cannot merge. Proven non-vacuous (TestDropGuard_FlagsUnsanctionedSite) and comment-safe (TestDropGuard_IgnoresComments); does not flag Collection.Drop (sentinel cleanup, not customer data).

Surgical / safe

Behaviour of what gets purged in the TTL reaper / team-deletion / user-DELETE flows is unchanged — the chokepoint only ADDs the audit + metric around the existing dispatch. The breaker wrapping is preserved. No proto change, no platform-DB access, no flag needed.

The larger fix for invariant (3) — a DeprovisionRequest.deletion_intent proto field + provisioner terminality enforcement (needs platform-DB read, fail-closed) — is designed + filed, not rushed, to avoid breaking the legitimate purges that drop active rows by design.

Verification

  • make gate: green (build + vet + go test ./... -short -count=1 -p 1, all packages).
  • New tests (9): chokepoint ok/error/breaker-open outcomes + metric assertions, caller attribution, and the CI guard (real-tree scan + synthetic positive + comment negative).
  • Rule 25: alert + dashboard tile + catalog row for instant_provisioner_drop_total ship in the matching infra PR.

🤖 Generated with Claude Code

… guard

Addresses the OPEN truehomie-db DROP incident root-cause class: an active
customer's DB+role were dropped by an unidentified path with NO audit trail.
The provisioner was a dumb executor — DeprovisionResource dropped whatever
token it was handed and kept no record of its own.

- guardedDrop (drop_chokepoint.go): the single sanctioned wrapper for a
  customer-data destruction. Every backend Deprovision dispatch in
  DeprovisionResource now routes through it. Emits a structured
  `event=provisioner.drop` audit log line (token, provider_resource_id,
  resource_type, backend, request_id, caller from gRPC peer) BEFORE the drop
  + instant_provisioner_drop_total{resource_type,backend,outcome}. This is the
  always-on, app-layer equivalent of the cluster's log_statement='ddl' trap.

- drop_guard_test.go: AST-iterating CI guard (rule 18) that FAILS the build if
  any DROP DATABASE/ROLE/USER SQL literal, ACL DELUSER, mongo dropDatabase, or
  Database().Drop() call appears outside a sanctioned deprovision function
  reached through guardedDrop. A new un-audited drop path cannot merge. Proven
  non-vacuous: a synthetic un-sanctioned DROP is flagged; a commented DROP and
  a Collection.Drop (not customer data) are not.

Behaviour of WHAT gets purged in the TTL reaper / team-deletion / user-DELETE
flows is unchanged — the chokepoint only ADDs the audit line + metric around
the existing dispatch. The larger deletion-intent proto field + terminality
enforcement (needs platform-DB read, flag-gated) is designed + filed in
docs/ci/DATA-INTEGRITY-DROP-PATH-AUDIT.md.

make gate: green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mastermanas805 mastermanas805 enabled auto-merge (squash) June 6, 2026 13:37
…002)

staticcheck QF1002: prefer errors.Is over == for error comparison; also more
correct for a potentially-wrapped circuit.ErrOpen.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mastermanas805 mastermanas805 merged commit 11475eb into master Jun 6, 2026
12 checks passed
mastermanas805 added a commit that referenced this pull request Jun 10, 2026
…migration-safety CI gate (truehomie D3) (#56)

* feat(dropguard): name-convention guard on every customer-data drop + migration-safety CI gate (truehomie D3)

Layer 2 of the truehomie-db DROP hardening (layer 1 = the guardedDrop audit
chokepoint, PR #50). The chokepoint made every sanctioned drop AUDITABLE;
this makes a MIS-TARGETED drop UNEXECUTABLE:

- internal/dropguard: charset+denylist validation of naming tokens and final
  db/user identifiers. Refuses empty/garbage tokens, SQL metacharacters,
  system databases (postgres, template0/1, instant_customers,
  instant_platform, mongo admin/local/config) and admin roles
  (instanode_admin, instant_cust, doadmin, postgres, redis "default").
  Deliberately permissive on token SHAPE (uuid, dashless hex, pool-*, e2e
  cohorts all pass) so no legitimate legacy deprovision can wedge — this
  repo auto-deploys.
- server.guardedDrop: validates the poolident-resolved naming token BEFORE
  dispatch; refusal = error + `provisioner.drop.refused` log event +
  instant_provisioner_drop_total{outcome="refused"}.
- postgres local Deprovision + cleanupProvisionPartial (non-chokepoint
  rollback path), dedicated deprovisionLocal: validate the FINAL constructed
  dbName/username via validateDropTargets right before DROP.
- mongo Deprovision: token guard before connect + per-candidate name guard
  (refused canonical = hard error; refused legacy candidate = skip).
- redis local/dedicated DELUSER: skip any non-tenant-shaped username
  (ACL DELUSER "default" would brick the shared pod).
- scripts/check-migration-safety.sh + ci.yml step: destructive DDL
  (DROP TABLE/DATABASE/SCHEMA/COLUMN/ROLE/USER, TRUNCATE, DELETE FROM
  without WHERE) in migrations/*.sql fails CI unless the file carries
  `-- destructive: acknowledged <reason>`. Self-test fixture suite included.

No behavior change for legitimate deprovision flows (regression tests assert
uuid/dashless/pool/e2e token shapes still reach the backend). New refusal
branches are unit-tested in every package; dropguard itself is 100% covered.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* feat(pool): chokepoint contract on the pool reaper's deprovision dispatch

pool.deprovisionBacking is the ONE customer-infra drop dispatch that does not
pass through server.guardedDrop (no gRPC request). Give it the same contract:
dropguard naming-token validation (refusal = provisioner.drop.refused) and
the `provisioner.drop` audit event emitted BEFORE the backend executes, with
caller="pool_reaper" and the proto-style resource_type strings so one NR
query covers RPC drops and pool-reap drops alike. Without this, every pool
reap of a failed postgres item would page the new customer-db-destructive-ddl
NR alert as an unsanctioned DROP (and was itself an un-audited drop path).

Test fixture fix: fakeRows.Scan filled every string dest with "postgres",
which dropguard now (correctly) refuses as a pool naming token — fill
positionally so pool_token carries a valid pool-token shape.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* fix(dropguard): identifier length bound is an absurdity bound (128), not postgres's 63

The live-cluster CI suite provisions tokens like tok<nano>_<TestName>
(>63 bytes with the usr_ prefix) — postgres truncates such identifiers
CONSISTENTLY on CREATE and DROP, so they round-trip fine in reality, and the
63-byte refusal wedged their Deprovision (4 coverage-job test failures).
Keep the cap purely as an absurdity bound at 128. Verified the four failing
live tests pass against a real postgres 16.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Manas Srivastava <[email protected]>
Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant